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Abstract — In the framework of prediction of individual se- 
quences, sequential prediction methods are to be constructed 
that perform nearly as well as the best expert from a given class. 
We consider prediction strategies that compete with the class 
of switching strategies that can segment a given sequence into 
several blocks, and follow the advice of a different "base" expert 
in each block. As usual, the performance of the algorithm is 
measured by the regret defined as the excess loss relative to the 
best switching strategy selected in hindsight for the particular 
sequence to be predicted. In this paper we construct prediction 
strategies of low computational cost for the case where the set 
of base experts is large. In particular we provide a method that 
can transform any prediction algorithm A that is designed for 
the base class into a tracking algorithm. The resulting tracking 
algorithm can take advantage of the prediction performance and 
potential computational efficiency of A in the sense that it can 
be implemented with time and space complexity only 0{n"' Inn) 
times larger than that of A, where n is the time horizon and 
7 > is a parameter of the algorithm. With A properly chosen, 
our algorithm achieves a regret bound of optimal order for 7 > 0, 
and only 0(ln7i) times larger than the optimal order for 7 = 
for all typical regret bound types we examined. For example, for 
predicting binary sequences with switching parameters under the 
logarithmic loss, our method achieves the optimal O(lnn) regret 
rate with time complexity 0(n^+^ Inn) for any 7 G (0, 1). 



I. Introduction 

In the on-line (sequential) decision problems considered in 
this paper, a decision maker (or forecaster) chooses, at each 
time instant t ~ 1,2,..., an action from a set. After each 
action taken, the decision maker suffers some loss based on the 
state of the environment and the chosen decision. The general 
goal of the forecaster is to minimize its cumulative loss. 
Specifically, the forecaster's aim is to achieve a cumulative 
loss that is not much larger than that of the best expert 
(forecaster) in a reference class £, from which the best expert 
is chosen in hindsight. This problem is known as "prediction 
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with expert advice." The maximum excess loss i?„ of the 
forecaster relative to the best expert is called the (worst- 
case) cumulative regret, where the maximum is taken over 
all possible behaviors of the environment and n denotes the 
time horizon of the problem. Several methods are known 
that can compete successfully with different expert classes 
in the sense that the regret only grows sub-linearly, that is, 
lim„_^oo Rn/n = 0. We refer to [IJ for a survey. 

While the goal in the standard online prediction problem is 
to perform nearly as well as the best expert in the class £, 
a more ambitious goal is to compete with the best sequence 
of expert predictions that may switch its experts a certain, 
limited, number of times. This, seemingly more complex, 
problem may be regarded as a special case of the standard 
setup by introducing the so-called meta experts. A meta expert 
is described by a sequence of base experts (ii, . . . , i„) e 
such that at time instants t — 1, . . . , n the meta expert follows 
the predictions of the "base" expert it e £ by predicting fi^,f 
The complexity of such a meta expert may be measured by 
C = |{t e {l,2,...,ri — 1} : it 7^ it+i}|i the number 
of times it changes the base predictor (each such change is 
called a switch). Note that C switches partition {!,..., n} 
into C + 1 contiguous segments, on each of which the meta 
expert follows the predictions of the same base expert. If a 
maximum of m changes are allowed and the set of base experts 
has N elements, then the class of meta experts is of size 
Sjlo ~ !)■' ■ Since the computational complexity 

of basic prediction algorithms, such as the exponentially 
weighted average forecaster, scales with the number of experts, 
a naive implementation of these algorithms is not feasible in 
this case. However, several more efficient algorithms have been 
proposed. 

One approach, widely used in the information theory/source 
coding literature, is based on transition diagrams |2|, |3|: A 
transition diagram is used to define a prior distribution on the 
switches of the experts, and the starting point of the current 
segment is estimated using this prior. A transition diagram de- 
fines a Markovian model on the switching times: a state of the 
model describes the "status" of a switch process (correspond- 
ing to, e.g., the time when the last switch occurred and the 
actual time), and the transition diagram defines the transition 
probabilities among these states. In its straightforward version, 
at each time instant t, the performance of an expert algorithm 
is emulated for all possible segment starting points 1, . . . 
and a weighted average of the resulting estimates is used to 
form the next prediction. In effect, this method converts an 
efficient algorithm to compete with the best expert in a class 
£ into one that competes with the best sequence of experts 
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with a limited number of changes. The time complexity of 
the method depends on how complex the prior distribution 
is, which determines the amount of computation necessary to 
update the weights in the estimate. Note that a general prior 
distribution would require exponential computational complex- 
ity in the sequence length, while at each time instant the 
transition diagram model requires computations proportional 
to the number of achievable states at that time instant. Using 
a state space that describes the actual time, the time of the 
last switch, and the number of switches so far, |2| provided 
a prediction scheme achieving the optimal regret up to an 
additive constant (for the logarithmic loss), and, omitting the 
number of switches from the states, a prediction algorithm 
with optimal regret rate was provided. f3l showed (also for the 
logarithmic loss) that the transition probabilities in the latter 
model can be selected so that the resulting prediction scheme 
achieves the optimal regret rate with the best possible leading 
constant, and the distributions they use allow computing the 
weights at time instant t with 0{t) complexity. As a result, 
in n time steps, the time complexity of the best transition- 
diagram based algorithm is a factor 0{n) times larger than 
that of the original algorithm that competes with £, yielding 
a total complexity that is quadratic in n. 

For the same problem, a method of linear complexity was 
developed in ||4l. It was shown in (|5| that this method is 
equivalent to an easy-to-implement weighting of the paths in 
the full transition diagram. Although, unlike transition diagram 
based methods, the original version of the algorithm of [4| 
requires an a priori known upper bound on the number of 
switches, the algorithm can be modified to compete with 
meta experts with an arbitrary number of switches: a linear 
complexity variant achieves this goal (by letting its switching 
parameter a decrease to zero) at the price of somewhat 
increasing the regret A slightly better regret bound can be 
achieved for the case when switching occurs more often at the 
price of increasing the computational complexity from linear 
to 0(n^/^) Q, lis) (by discretizing its switching parameter a 
to y/n levels). 

In another approach, reduced transition diagrams have been 
used for the logarithmic loss (i.e., lossless data compression) 
by 191 and by |3| (the latter work considers a probabilistic 
setup as opposed to the individual sequence setting). Reduced 
transition diagrams are obtained by restricting some transi- 
tions, and consequently, excluding some states from the orig- 
inal transition diagram, resulting in (computationally) simpler 
models that, however, have less descriptive power to represent 
switches. An efficient algorithm based on a reduced transition 
diagram for the general tracking problem was given in ifTOl , 
while 111] developed independently a similar algorithm to 
minimize adaptive regret, which is the maximal worst-case 
cumulative excess loss over any contiguous time segment 
relative to a constant expert. It is easy to see that algorithms 
with good adaptive regret also yield good tracking regret. 

An important question is how one can compete with meta 
experts when the base expert class £ is very large. In such 
cases special algorithms are needed to compete with ex- 
perts from the base class even without switching. Such large 
base classes arise in on-line linear optimization IIT2I . lossless 



data compression lfT3l - |fT31 , the shortest path problem IfT^ , 
II 17 1 , or limited-delay lossy data compression flSl-fSOl. Such 
special algorithms can easily be incorporated in transition- 
diagram-based tracking methods, but the resulting complexity 
is quadratic in n (see, e.g., ||3J for such an application to 
lossless data compression or 11211 -11231 for applications to 
signal processing and universal portfolio selection). If the 
special algorithms for large base expert classes are combined 
with the algorithm of |4| to compete with meta experts, the 
resulting algorithms again have quadratic complexity in n; see, 
e.g., |5|, |24| (the main reason for this is that the special 
implementation tricks used for the large base expert classes, 
such as dynamic programming, are incompatible with the 
efficient implementation of the algorithm of |4| for switching 
experts). The only example we are aware of where efficient 
tracking algorithms with linear time complexity are available 
for a meaningful, large class of base experts is the case of 
online convex programming, where the set of base experts 
is a finite dimensional convex set and the (time-varying) 
loss functions are convex 11251 (see also the related problem 
of tracking linear predictors ||26ll ). In this case projected 
gradient methods (including exponentially weighted average 
prediction) lead to tracking regret bounds of optimal order 
Note that instead of the number of switches, these bounds 
measure the complexity of the meta experts with the more 
refined notion of Lp norms. 

In this paper we tackle the complexity issue in competing 
with meta-experts for large base expert classes by presenting 
a general method for designing reduced transition diagrams. 
The resulting algorithm converts any (black-box) prediction 
algorithm A achieving good regret against the base-expert 
class into one that achieves good tracking and adaptive regret. 
The advantage of this transition-diagram based approach is 
that the conversion is independent of the base prediction 
algorithm A, and so some favorable properties of A are 
automatically transferred to our algorithm. In particular, the 
complexity of our method depends on the base-expert class 
only through the base prediction algorithm A, thus exploiting 
its potential computational efficiency^ Our algorithm unifies 
and generalizes the algorithms of ||9], 1011 and our earlier work 
lilOi . This algorithm has an explicit complexity -regret trade- 
off, covering essentially all such results in the literature. In 
addition to the (almost) linear complexity algorithms in the 
aforementioned papers, the parameters of our algorithm can 
be set to reproduce the methods based on the full transition 
diagram Q, lEl, 11211 . or the complexity -regret behavior of 
liTJ, |8|. Also, our algorithm has regret of optimal order with 
complexity ©(n^+^lnn) for any 7 e (0,1), while setting 
7 = results in complexity 0{nlnn) and a regret rate that is 
only a factor of In n larger than the optimal one (similarly to 

The rest of the paper is organized as follows. First the 
online prediction and the tracking problems are introduced 
in Section nil In Section HII-AI we describe our general algo- 
rithm. Sections IIII-BI and IIII-CI present a unified method for 

' Other black-box reductions of forecasters for different notions of regret are 
available in the literature; for example, the conversion of forecasters achieving 
good external regret to ones achieving good internal regret 1271 . 1281 . 
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PREDICTION WITH EXPERT ADVICE 
For each round t = 1,2, . . . 

(1) the environment chooses the next outcome yt 
and the expert advice {/^ t G V : i G £}; the 
expert advice is revealed to the forecaster; 

(2) the forecaster chooses the prediction pt G V; 

(3) the environment reveals the next outcome yt E 

y; 

(4) the forecaster incurs loss £{pt,yt) and each 
expert i incurs loss t{fi_t,yt)- 



Fig. 1 . The repeated game of prediction witli expert advice. 

the low-complexity implementation of the general algorithm 
via reduced transition diagrams. Bounds for the performance 
the algorithm are developed in Section IIII-DI More explicit 
bounds are presented for some important special cases in 
Sections [III-EI and IIII-FI The results are extended to the related 
framework of randomized prediction in Section HV] Some 
applications to specific examples are given in Section [Vl 

II. Preliminaries 

In this section we review some basic facts about prediction 
with expert advice, and introduce the tracking problem. 

A. Prediction with expert advice 

Let the decision space I? be a convex subset of a vector 
space and let 3^ be a set representing the outcome space. Let i : 
2? X 3^ ^ M be a loss function, assumed to be convex in its first 
argument. At each time instant t — 1, . . . ,n, the environment 
chooses an action yt Ey and each "expert" i from a reference 
class £ forms its prediction fi,t G T). Then the forecaster 
chooses an action pt E T) (without knowing yt), suffers loss 
£{pt,yt), and the losses i{fi,t,yt),i G f are revealed to the 
forecaster. (This is known as the full information case and in 
this paper we only consider this model. In other, well-studied, 
variants of the problem, the forecaster only receives limited 
information about the losses.) 

The goal of the forecaster is to minimize its cumulative loss 
Ln — J^^^iJ-iPt^Ut), which is equivalent to minimizing its 
excess loss L„ — min^gf Li_„ relative to the the set of experts 
£, where Lj,„ = YJ^=i^{fi,t,yt) for all i e £. 

Several methods are known that can compete successfully 
with different expert classes £ in the sense that the (worst- 
case) cumulative regret, defined as 

Rn = max [Ln- mm Lin] 

{yi,-.y^)ey" \ i££ J 

(n n \ 

y^^iPt.Vt) - min V^(/,,t,2/t) 



only grows sub-linearly, that is, lim„_j.oo Rn/n = 0. One of 
the most popular among these is exponential weighting. When 
the expert class £ is finite or countably infinite, this method 
assigns, at each time instant t, the nonnegative weight 

to each expert i E £. Here Li^t-i — Kfi,siys) is the 

cumulative loss of expert i up to time t — 1, > is called 
the learning parameter, and the Wi > are nonnegative initial 
weights with X^ief ^ 1' s° '^^at J2ies = ^ define 
Li.o = for all i E £, as well as Lq = 0). The decision 
chosen by this algorithm is 

Pt ^^'^i,tfz,t (1) 

which is well defined since T? is convex. 

In this paper we concentrate on two special types of loss 
functions: bounded convex and exp-concave. For such loss 
functions the regret of the exponentially weighted average 
forecaster is well understood. For example, assume £ is convex 
in its first argument and takes its values in [0, 1], and the set 
of experts is finite with \£\ = N. If rjt is nonincreasing in t, 
then for all n, 

( 1 1 1 " 

i„ < min <^ L,,„ + _ In — I + y ^ , (2) 

see II29I . By setting the initial weights to Wi — l/N,i = 
1, . . . ,N and with the choice % — 2 y/\n N/t, one obtains for 
all n>l, 

Rn < VnlnN . (3) 

If, on the other hand, for some 77 > the function F{p) = 
g-viip,y) is concave for any fixed y E y (such loss func- 
tions are called exp-concave) then, choosing r/f = 77 and 
Wi = l/N, i = 1, . . . , N, one has for all n > 1, 



We note that the regret bounds in (|2|i-(|4|i do not require a fixed 
time horizon, that is, they hold simultaneously for all n> 1. 

The family of exp-concave loss functions includes, for 
example, for p,y E [0, 1], the square loss £{p, y) ~ {p — y)^ 
with 77 < 1/2, and the relative entropy loss £{p, y) = y In ^ + 
{1 — y) In with 77 < 1. A special case of the latter is 
the logarithmic loss defined for y E {0,1} and p E [0,1] 
by £{p,y) = — Ij,=i Inp — Iy=olii(l — p), which plays a 
central role in data compression. Here and throughout the 
paper Ib denotes the indicator of event B. We refer to O] 
for discussions of these bounds. 

B. The tracking problem 

In the standard online prediction problem the goal is to 
perform as well as the best expert in a given reference class 
£. In this paper we consider the more ambitious goal of 
competing with a sequence of expert predictions that are 
allowed to switch between experts. Formally, such a meta 
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expert is defined as follows. Fix the time horizon n > 1. A 
meta expert that changes base experts at most C > times can 
be described by a vector of experts a = (ip, . . . , ic) G E'-^'^^ 
and a "transition path" T ~ {ti, . . . ,tc',n) such that to :— 
I < h < . . . < tc < tc+i := n + 1. For each c = 0, . . . , C, 
the meta expert follows the advice of expert ic in the time 
interval [tc,tc+i)- When the time horizon n is clear from the 
context, we will omit it from the description of T and simply 
write T — {ti, . . . , tc)- We note that this representation is not 
unique as the definition does not require that base experts ic 
and ic+i be different. Any meta expert that can be defined 
using a given transition path T is said to follow T. 

The total loss of the meta expert indexed by (T, a), accu- 
mulated during n rounds, is 



We will also consider the related notion of adaptive regret 



L„{T, a) 



c 

c=0 



c+l) 



where Li{ti,t2) — J2u^ti ^(/i,ti Dt) denotes the loss of expert 
i G £ in the interval [ti,t2),l < ti < t2 < n. For any 
t > 1, let 7t denote the set of all transition paths up to time 
t represented by vectors {ti, . . . ,tc',t) with 1 < ti < t2 < 
... < tc < t and < C < t. For any T = (ti, . . . , tc) £ 
Tn and t < n define the truncation of T at time t as Tf — 
{ti, . . . ,tk; t), where k is such that t^ <t < tk+i (note that 
t < n guarantees that tc+i — n + 1 > t, and so tk+i is well- 
defined). Furthermore, let Tt{T) ~ Tt{Tt) = tk denote the last 
change up to time t, and let Ct{T) = C{Tt) — k denote the 
number of switches up to time t. A transition path T with C 
switches splits the time interval into C+l contiguous 

segments. 

Our goal is to perform nearly as well as the meta-experts, 
that is, to keep the regret L„ — Ln{T, a) small relative to the 
meta-experts (T, a) for all outcome sequences . . . ,y„. It 
is clear that this cannot be done uniformly well for all meta 
experts; for example, it is obvious that the performance of 
a meta expert that is allowed to switch experts at each time 
instant cannot be achieved for all outcome sequences. Indeed, 
it is known |j4l, ll30l that, for exp-concave loss functions, 
the worst-case regret of any prediction algorithm relative to 
the best meta-expert with at most C switches, selected in 
hindsight, is at least of the order of (C + l) logn, where the 
worst-case tracking regret with respect to meta experts with 
at most C switches is defined as 



max 

yi,---,y,i 



min L(T, a 

{T,a):C„{T)=C 



Algorithms achieving optimal regret rates are known under 
general conditions: for general convex loss functions and 
a finite number of base experts, a tracking regret of order 
(C(T) + l)Vn\nn (or V(C+ l)nlnn if C is known in 
advance) can be achieved H, 15], l|24l, while the 0((C + 
1) Inn) lower bound is achievable in case of exp-concave loss 
functions and a finite number of experts f^l-f?], f6i, II2TII . 
or when the base experts form a convex subset of a finite 
dimensional linear space OTl . 



i?° = max max 

t<t' yt,yt+i,---^yt' 



T=t 



introduced in ||3TI and ifTTI . which is the maximal worst- 
case cumulative excess loss over any contiguous time segment 
relative to a constant expert. Minimizing the tracking and 
the adaptive regret are similar problems. In fact, one can 
show that the FLHl algorithm of 131] developed to minimize 
the adaptive regret and a dynamic version of the fixed-share 
algorithm of ID introduced by 16] to minimize the tracking 
regret are identical. Furthermore, any algorithm with small 
adaptive regret also enjoys small tracking regret, since the 
regret, in n time steps, relative to a meta expert that can 
switch the base expert C times can be bounded by (C+ l)i?°. 
Although tracking regret bounds do not immediately yield 
bounds on the adaptive regret (since the regret on a time 
segment may be negative), it is usually straightforward to 
modify the proofs for tracking regret to obtain bounds on the 
adaptive regret; see, e.g., the proof of Theorem |2] 

III. A REDUCED COMPLEXITY TRACKING ALGORITHM 

A. A general tracking algorithm 

Here we introduce a general tracking method which forms 
the basis of our reduced complexity tracking algorithm. Con- 
sider an on-line forecasting algorithm A that chooses an 
element of the decision space depending on the past outcomes 
and the expert advices according to the protocol described 
in Figure [T] Suppose that for all n and possible outcome 
sequences of length n, A satisfies a regret bound 



(5) 



with respect to the base expert class £, where p£ : [0, 00) — > 
[0, 00) is a nondecreasing and concave function with p£{0) = 
0. These assumptions on pg are usually satisfied by the known 
regret bounds for different algorithms, such as the bounds (O 
and (|4|i (with defining p£{0) = in the latter case). Suppose 
1 < ti < t2 < n and an instance of A is used for time instants 
t G [ti, ^2) := {^1, ■ • • , ^2 — 1}, that is, algorithm A is run on 
data obtained in the segment [ti,t2). The accumulated loss of 
A during this period will be denoted by Lj^{ti,t2). Then (|5]l 
implies 

LA{ti,t2) - mmLi{ti,t2) < P£{t2 - ti). 

Running algorithm A on a transition path T = 
{ti, . . . jtcin) means that at the beginning of each segment 
of T (at time instants tc) we restart A; this algorithm will 
be denoted in the sequel by {A,T). Denote the output of 
this algorithm at time t by fAA^t) = fA,t(Tt{T)). This 
notation emphasizes the fact that, since A is restarted at the 
beginning of each segment of T, the output of [A, T) at time 
t is influenced by T only through Tt{T), the beginning of the 
segment that includes t. The loss of algorithm {A, T) up to 
time n is 



Ln[AT) 



c 

E 

c=0 



LA{tc, ic+l) 
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As most tracking algorithms, our algorithm will use weight 
functions wt : Tt ^ [0, 1] satisfying 



J2MTt) = 1 and wt{Tt] 



TteTt 



(6) 

for all i = 1,2,... and T E T- Thus each wt is a probability 
distribution on 7t such that the family {wt; t — 1, . . . ,n} is 
consistent. To simplify the notation, we formally define Tq as 
the "empty transition path" To ■— {Tq}, Lo{A,To) := 0, and 
■wo{To) := 1. 

We say that T E Tn covers T E Tn if the change points 
of T are also change points of T. Note that if T covers T, 
then any meta expert that follows transition path T also follows 
transition path T. We say that Wn covers Tn if for any T E Tn 
there exists ?iT eTi with Wn{T) > which covers T. 

Now we are ready to define our first master algorithm, given 
in Algorithm[T| We note that the consistency of {wt} implies 
that, for any time horizon n. Algorithm [T] is equivalent to the 
exponentially weighted average forecaster ^ with the set of 
experts {{A,T) : T E 7^i,w„(r„) > 0} and initial weights 
w„(r) for {A,T). The performance and the computational 
complexity of the algorithm heavily depend on the properties 
of Wt', in this paper we will concentrate on judicious choices 
of Wt that allow efficient computation of the summations in 
Algorithm [T] and have good prediction performance. 

Algorithm 1 General tracking algorithm. 

Input: prediction algorithm A, weight functions {wt;t = 

1, . . . , n}, learning parameters rjt > Q,t — 1, . . . ,n. 

For t = 1, . . . ,n predict 



Pt 



The next lemma gives an upper bound on the performance 
of Algorithm [T] 

Lemma 1: Suppose 774+1 < rjt for all t = 1, . . . , n — 1, the 
transition path T„ is covered by r„ = (ti, . . . ,t^^f ^) such 

that Wn{Tn) > 0, and A satisfies the regret bound (|5]). Assume 
that the loss function £ is convex in its first argument and takes 
values in the interval [0,1]. Then for any meta expert {Tn, a), 
the regret of Algorithm [T] is bounded as 



Ln LniTn, O,) 

C(f„) „ 



c=0 



< {C{Tn) + l)p£{ 



t=l 

n 



Vn Wn[Tn) 



C{Tn) + 1 



t=i 



Vn Wn{Tn) 



(7) 



On the other hand, if I is exp-concave for the value of 7/ and 
Algorithm [T] is used with rjt = rj, then 

Ln ~ Ln{Tn, a) 



""^^"^ . 1 1 

c=0 ' Wn{Tn) 



< iC{Tn) + l)p£ 



C{Tn) + l V Wn{Tn) 



(8) 



Proof: Let a — (jq, . . . , iq} be the expert vector such that 
the meta experts [T, a) and (T, d) perform identically. Then 
clearly 

Ln - Ln{T,a) 

= Ln — Ln{A, Tn) + Ln{A, Ti) — LniTn, o) ■ 

Using ^ and the concavity of pg, we get 

Ln{A, Ti) — Ln{Tn, O) 
C(f„) 

= E iL^{ic,ic+i) — Li^{ic,ic+i) 

C(f„) 



c=0 



< (C(T„) + 1W 



C{Tn) + 1 



(9) 



Assume that the loss function £ is convex in its first argu- 
ment and takes values in the interval [0, 1]. Since Algorithm [T] 
is equivalent to the exponentially weighted average forecaster 
with experts {iA,T) : T E Tn,vJn{T) > 0} and initial 
weights Wn{T), we can apply the bound (|2]) to obtain 



11 

Ln < Ln (A, Tn) + - In — + Y 



V Wn(Tn) t=l " 

Combining this with (|9]l proves Q. 

Now assume £ is exp-concave. Then by |4, Lemma 1], 

Ln - Ln{A,fn) < -In- 



^ Wn{Tn) 

This, together with (|9]), implies dgll. 



(10) 



B. The weight function 

One may interpret the weight function {wt} as the con- 
ditional probability that a new segment is started, given 
the beginning of the current segment and the current time 
instant. In this case one may define {wt} in terms of a time- 
inhomogeneous Markov chain {Ut; t = 1,2, .. .} whose state 
space at time t is {1, . . .,t}. Starting from state Ui = 1, at 
any time instant t, the Markov-chain either stays where it was 
at time t — 1 or switches to state t. The distribution of {Ut} 
is uniquely determined by prescribing P{Ui = 1) = 1 and for 
I <t' <t, 

ViUt=t\Ut-i=t') 

= l-P{Ut=t'\Ut-i=t')=p{t\t') (11) 

where the so-called switch probabilities p{t\t') need only 
satisfy p{t\t') E [0,1] for all 1 < t' < t. A realization 
of this Markov chain uniquely determines a transition path: 
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Tt{ui, . . . ,ut) = {ti,...,tc) eTt if and only if Uk-i ^ Uk 
for k e {ti, . . .,tc}, and Uk-i ^ Uk for k ^ {^i, . . .,tc}, 
2 < k < t. Inverting this correspondence, any T ^Tt uniquely 
determines a realization (wi, . . . , ut). Now the weight function 
is given for all t > 1 and T e 7t by 



WtiT)=¥{Ui=ui,...,Ut=Ut) 



(12) 



where (wi, . . . ,ut) is such that T = T{ui, . . . ,Ut)- It is easy 
to check that {wt] satisfies the two conditions in Clearly, 
the switch probabilities p{t\t') uniquely determine {wt}- The 
above structural assumption on {wt], originally introduced in 
ill], greatly reduces the possible ways of weighting different 
transition paths, allowing implementation of Algorithm [T] with 
complexity at most 0{ii?) (if one step of A can be imple- 
mented in constant time), instead of the potentially exponential 
time complexity of the algorithm in the naive implementation; 
see Section UlI-CI 

Some examples that have been proposed for this construc- 
tion (given in terms of the switch probabilities) include 

, used in im, is defined by PHwiA^') = ol for some 



< Of < 1. 



. u;^'5,usedin|l6l, i], |[IT], is defined by = 



, used in ||2l, is defined by 



1/2 



(13) 

which is the Krichevsky-Trofimov estimate |[T3) for bi- 
nary sequences of the probability that after observing 
an all zero sequence of length t — t', the next symbol 
will be a one. Using standard bounds on the ICrichevsky- 
Trofimov estimate, it is easy to show (see, e.g., |2|) that 
for any T e 7^ with segment lengths sq, si, . . . , sc > 1 
(satisfying X;f=o *c = 

1 



In 



^ ' c=0 



1 ^ 

< - Vlnsc + (C+ l)ln2. (14) 



w^^ and w^^ used in f3l (similar weight functions were 
considered in |5J), are defined as follows: for a given 
< e < 10 let TT, =J/j'^^ Zt = <j) (with 

Zq = and Zoo — J^jLi ""O))- Then w^^ and w^^ are 
defined, respectively, by 



and 



{Zoo — Zt-2) 

7r(t - t') 

{Zoo — Zt-t' + l) 



Here we consider the weights w'~^ . It is shown in |I3] proof 
of Eq. (39)] that for any T e Tn, 

1 



In- 



w';i'{T) 



< (C„(T)+e)lnn+ln(l + e)-C„(T)lne . (15) 



-The upper bound e < 1 is missing from [3J, althougii it is implicitly 
required in tlie proof. 



C. A low-complexity algorithm 

Efficient implementation of Algorithm [T| hinges on three 
factors: (i) Algorithm A can be efficiently implemented; (ii) 
the exponential weighting step can be efficiently implemented; 
which is facilitated by (iii) the availability of the losses 
Lj[{t', t) at each time instant t for all 1 < i' < i in the sense 
that these losses can be computed efficiently. In what follows 
we assume that (i) and (iii) hold and develop a method for (ii) 
via constructing a new weight function {wt} that significantly 
reduces the complexity of implementing Algorithm [T] 

First, we observe that the predictor pt of Algorithm [T] can 
be rewritten as 



Pt 



n^iv,{t') 

where the weights vt are given by 



E 

T£Tt:Tt{T)=t' 



(16) 



(17) 



Note that vt{t') gives the weighted sum of the exponential 
weights of all transition paths with the last switch at t' . 

If the learning parameters rjt are constant during the time 
horizon, the above means that Algorithm [T] can be imple- 
mented efficiently by keeping a weight Vt{t') at each time 
instant t for every possible starting point of a segment t' = 
l,...,t. Indeed, if ?7t 77 for all t, then (dUl, and (fT2] l 
imply that each Vt{t') can be computed recursively in 0{t) 
time from the vt^i (setting wi(l) := 1 at the beginning) using 
the switch probabilities defining wt as follows: 

'vt-i{t'){l - p(t|i'))e~''^(-^-^''-i 
. fort' = l,...,t- 1, 



El^Uvt-i{t")p{t\t")e-^'^f^^'-' 
for t' 



1) 



t. 



(18) 

Using this recursion, the overall complexity of computing the 
weights during n rounds is 0{n^). Furthermore, ( fTSI l means 
that one needs to start an instance of A for each possible 
starting point of a segment. If the complexity of running 
algorithm A for n time steps is 0{n) (i.e., computing A 
at each time instant has complexity 0(1)), then the overall 
complexity of our algorithm becomes 0{n^). 

It is clearly not a desirable feature that the amount of 
computation per time round grows (linearly) with the horizon 
n. While we don't know how to completely eliminate this 
ever-growing computational demand, we are able to moderate 
this growth significantly. To this end, we modify the weight 
functions in such a way that at any time instant t we allow 
at most 0{g\nt) actual segments with positive probability 
(i.e., segments containing t that belong to sample paths with 
positive weights), where g > is a parameter of the algorithm 
(note that g may depend on, e.g., the time horizon n). 
Specifically, we will construct a new weight function wt such 
that 

\{Tt{T) : wt{Tt) > 0,T e r„}| < [|j (LlogiJ + 1) 

where log denotes base-2 logarithm. By doing so, the time and 
space complexity of the algorithm becomes 0{g\x\n) times 
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more than that of algorithm A, as we need to run 0{g\nn) 
instances of A in parallel and the number of non-zero terms 
in ( fTSl l and ( fTSI l is also 0(5 Inn) (here we exclude the trivial 
case where A has zero space complexity; also note that the 
time-complexity of A is at least linear in n since it has to 
make a prediction at each time instant). Thus, in case of a 
Unear-time-complexity algorithm A, the overall complexity of 
Algorithm [T] becomes 0{gn\nn). 

In order to construct the new weight function, at each time 
instant t we force some segments to end. Then any path that 
contains such a segment will start a new segment at time t 
(and hence the corresponding vector of transitions contains 
t). Specifically, any time instant s can be uniquely written as 
o2" with o being a positive odd number and u a nonnegative 
integer (i.e., 2" is the largest power of 2 that divides s). We 
specify that a segment starting at s can "live" for at most g2'^ 
time instants, where g > is a parameter of the algorithm, 
so that at time s + g2^ we force a switch in the path. More 
precisely, given any switch probability p{t\t') for all t' < t, 
we define a new switch probability 

p{t\t') = l-ht{t'){l^p{t\t')) (19) 

where 




1 if s<t<s + .g2", 
otherwise. 



Thus ht{s) = 1 if and only if a segment started at s is still 
valid at time t. In terms of the Markov chain {Ut} introduced 
in (fTTT i. the new switch probabilities in definition ( fT9] l mean 
that if the chain is in state t' at time t—1 such that ht{t') = 1, 
then the chain switches to state t with the original switch 
probability p{t\t') and remains at state t' with probability 1 — 
p{t\t'); but if ht{t') — 0, then the chain switches to state t 
with probability 1. In this way, given the switch probabilities 
p{t\t') and the associated weight function {wt}, we can define 
a new weight function {wt} via the new switch probabilities 
p{t\t') and the procedure described in Section ITlI-BI Note that 
the definition of {wt} implies that for a transition path T E Tt 
either 

Wt{T) = or Wt{T)>WtiT). (20) 

The above procedure is a common generalization of several 
algorithms previously reported in the literature for pruning the 
transition paths. Specifically, g = 1 yields the procedure of [9J, 
5 = 3 yields our previous procedure ifTol . g — 4: yields the 
method of [11], while g — n yields the original weighting 
{wt} without pruning. We will show that the time complexity 
of the method with a constant g (i.e., when g is independent 
of the time horizon n) is, in each time instant, at most 
0(ln n) times the complexity of one step of A, while the time 
complexity of the algorithm without pruning is 0{n) times 
the complexity of A. Complexities that interpolate between 
these two extremes can be achieved by setting g — o{n) 
appropriately. 

We say that a segment at time instant t is alive if it contains 
t and is valid if there is a path Tf with Wt{Tt) > that 
contains exactly that segment. In what follows we assume that 
the original switch probabilities p{t\t') associated with the wt 



satisfy p{t\t') G (0, 1) for all 1 < t' < t. (Note that the weight 
function examples introduced in Section IIII-BI all satisfy this 
condition.) The condition implies that Wt{Tt) > for all Tj £ 
Tt- Furthermore, if Tf = (^i, . . . , tc) G 7t satisfies t^+i — ti < 
(72"*i , i = 1, . . . , C, where ut^ is the largest power of 2 divisor 
of U, then from ([T9]l we get wt{T) > 0. 

The next lemma gives a characterization of when ht{s) ~ 1 
and, as a consequence, bounds the number of valid segments 
that are alive at t. 

Lemma 2: Let t = X^I^li 2"' be the binary form of t with 
< ui < U2 < • • • < M,„, Sfc = YJiLk 2"% "0 = -1- 
Then ht{s) = 1 if and only if s = Sk — j2" for some < 
u < Uk and j £ {0, ... , 5 — 1} such that 2" is the largest 2- 
power divisor of s; in particular, j is even \f u = for some 
fc G {1, . . . , m}, and odd otherwise. As a consequence, at any 
time instant t there are at most [5/2]([logtJ +1) segments 
that are valid and alive. 

Proof: It is clear that for any s satisfying the conditions 
of the lemma, ht{s) = 1 since s + .g2" = Sk - j2" + .92" > 
Sk + 2" > t > s. To prove the other direction, consider an 
s E assume ht{s) ~ 1 and denote the largest 2-power 

divisor of s by 2". By definition, ht{s) = 1 if and only if 
s + j2" <t<s + {j + 1)2" for some j G {0, . . . , 5 - 1}. 
After reordering we obtain 

0' + 1)2" < s < j2". (21) 

Let fc G {1, . . . , to} be the unique index such that Uk-i < u < 
Uk (note that u < ?i.,„ always holds). Then 2" divides Sk, and 
Sk < t < Sk + 2". Combining this inequality with (ISTT i gives 
Sfe ~ U + 1)2" < s < Sk - (j - 1)2". Taking into account that 
both s and Sk are divisible by 2", we obtain s = Sk — j2". 
Furthermore, since 2" is the largest 2-power divisor of s, j 
must be even when u ~ Uk for some k G {1, . . . ,to}, and 
odd otherwise. 

Finally, for any w G {0, 1, . . . , Um}, the set 

{s = Sk - i2" : Uk-i <u<Uk,j = 0,...,g-l, 

2" is the largest 2-power divisor of s} 

has at most \g/2~\ elements. Since Um — [logtj, the proof is 
complete. ■ 
Note that for g = I the valid segments that are alive at t 
start exactly at Sk,k — 1, . . . , m, and so the number of valid 
segments at time t is exactly the number of I's in the binary 
form of t |9|. The above lemma implies that Algorithm [T] can 
be implemented efficiently with the proposed weight function 
{wt}. 

Theorem 1: Assume Algorithm [T] is run with weight func- 
tion {wt] derived using any 5 > from any weight function 
{wt] defined as in Section UlI-BI If rjt = rj for some > 
and all t — 1, . . . ,n, then the time and space complexity of 
Algorithm [U is 0{glnn) times the time and space complexity 
of A, respectively. 

Proof: The result follows since Lemma |2] implies that the 
number of non-zero terms in dTSI l and (fT6l) is always 0{g In t). 



g 



D. Regret bounds 

To bound the regret, we need the following lemma which 
shows that any segment [i, t') can be covered with at most 
1 valid segments. 



log(t'-t) 



Llog(3+l)J 

Lemma 3: For any T e T!„, there exists T e r„ such that 
for any segment [t, t') of T with I < t < t' < n + I, 
(i) Wf (f ) > 0, i and t' are switch points of f (if t' = n + l, 
it is considered as a switch point), and T contains at most 



I 



log(t'-t) 



1 segments in [<, i'); 



Liog(s+i)J 

(ii) if the switch points of T in [t,t') are ti := t < t2 < 
■ ■ ■ < til < ti'+i := t', then I' < I, and for any 
nondecreasing function / : [0,oo) — > [0,oo), 



E 



U) 



I' -2 



i=0 



t'-t 



2iLlog(9+l)J 



+ fit' - 1) 



(22) 



< 



Llog(g+l)J 



/ 



t'-t 



dx + 2f{t' - t) (23) 



where the second summation in (|22] | is empty if /' = 1. 
Remark: Note that it is possible to obtain for I the less 
compact and harder-to-handle formula 



I = 



log 



t' -t+- 



Llog(3+l)J _1 



2lloE(g + l)J 



2Llog(s + l)J _1 



Llog(5 + 1)J 



by taking into account that the last segment \ti,ti^i) in the 
construction of the proof can always be defined to be of length 
at least [log(g+l)J 2"' . Furthermore, for 5 = 1 it follows from 
[[91 that the last term is not needed in (l22T i. and hence the latter 
bound can be strengthened to 



r Llog(i'-i)J 



(24) 



j=0 



Proof: Clearly, it is enough to define T independently 
in each segment [t,t') of T. We construct the switch points 
ii < ^2 < • • • < ^i' of T in this interval, for some /' < I, and 
an auxiliary variable tj'+i > t' one by one such that ti = t, 
ti' < t' and, defining Uj as the largest 2-power divisor of tj, 



i-Uj> [log(g + 1)J 



(25) 



for j — 1, ...,/' — 1. Assume that we have already defined 
ti, . . . ,ti satisfying ( l25T l for j = 1, . . . ,i — 1. Then a segment 
starting at ti may be alive with positive probability at any 
time instant in [ti,ti + g2^'). Define Ui+i to be the largest 
nonnegative integer such that there is an s S [ti + l,ti + 
52"'] such that 2"'+^ divides s. Then s2^"' belongs to the set 
S, = {i,2-"%tj2-"'+l,i,;2-"'+2, . . .,t,2-"-'+g} (although, 
clearly, s2^"' 7^ ^^2^"')- Since Si is a set of g + 1 consecutive 
integers, it has an element a that is divisible by 2L'°8(9+i)J^ 
and this element is not the odd number ti2^^\ Thus 2"' a £ 



[t, + l,t,+g2^^] and since 2"'a is divisible by 2"'+Li°g(9+i)J , 
the maximal property of the 2-power divisor 2"'+i of s implies 
that Ui^i > Ui + [log(g + 1)J. Therefore, defining ti^i = s, 
its largest 2-power divisor is 2"*+\ proving ( l25T l for j — i 
(note that it is easy to show that the choice of s, and hence 
that of ti+i, is unique). 

Now let /' be the smallest integer such that tp+i > t'. To 
prove part (i) of the lemma, it is sufficient to show that /' < I 
and the segments [ti,t2), [^2,^3), ■ • • , [ti'-i,ti'), [tv ,t') cover 
which is clearly true if > t' . From ( |25] | and the 
fact that ti+i — ti is divisible by 2"', we have 



ti+i 



i=l i=l 
I 

> t + ^2"i+^j=2Li°s(9+i)J 



i=l 
l-l 



t + ^2"i 



+iLlog(g+l)J 



4=0 



t 



2;Liog(s+i)J _ 1 



2Llog(g+l)J _ 1 
> t + 2('~^^Liog(9+i)J > f' 

where in the last step we used the definition of I. This finishes 
the proof of (i). 

To prove (ii), we first show that the transition path T 
constructed above satisfies (|22] |. where, with a slight abuse 
of notation, we redefine i/'+i from part (i) to be t'. First 
notice that since t + g2^''-^ < + g2"''-^ < t\ we have 
ui'-i < log -^-^J . Repeated application of ( l25T l implies, for 
any i — 1, — 1, 



1 

Ui < 



log 



t'-t 



and 



ti+^ — ti 



-(/'-l-z) Llog(ff+l)J 



< g2^'°'^^-l^^''"^"'^Li°g(s+i)J 

< g2'' 



Using the crude estimate t' — ti < t' — t finishes the proof 
of ( |22] |. The last inequality (|23] | holds trivially for I — 1, and 
holds for I > 2 since 



i'-2 

E/ 

i=0 



t'-t 



2iLiog(9+i)J 



f{t'-t) + yj 



-2 
E. 



t'-t 



2iLiog(s+i)J 



< f{t'-t) + 

< f{t'-t) + 



log(t'-t) 
Ll°g{9+l)J 



t'-t 



l°g(t'-t) 

Liog(s+i)j 



2xLlog(g+l)J 
t'-t 



dx 



2a;Llog(g+l)J 



dx. 
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Taking into account that C(T„) < C(r„) if r„ covers T„, 
Lemma [3] trivially implies the following bounds. 

Lemma 4: For any Tn G Tn there exists a Tn E Tn with 
Wn{Tn) > such that r„ covers r„ and 



C(T„) < C7(T„) < (C(r„) + 1) 



1 



where 



log " 



Llog(g+l)J 

Liog(s+i)J 



+ 1 if C = 0, 
2 if 1. 



(26) 



(27) 



Proof: The lower bound is trivial, and the upper bound 
directly follows from Lemma|3]for C(T„) = 0. For C(r„) > 
1 the upper bounds follow since on each segment of r„ we 
can define Tn as in the proof of Lemma [3] Hence, if T = 
(ti, . . . ,tc\n), then 



c(r„) 



1 < 



< 



C+l 

E 

C+l 

E 



Iog(t, - 1 



i-l) 



< 



(C + l) 



Llog(.9 + 1)J 

log(t, - ti_i) 
Llog(.9 + l)J " 



Liog(g + i)J 



where in the last step we used Jensen's inequality and the 
concavity of the logarithm. ■ 

We now apply the preceding construction and results to the 
weight function {wt} = {wf^} to obtain our main theorem: 

Theorem 2: Assume Algorithm [T] is run with g > and 
weight function {wf^} for some < e < 1 (derived from 
{wf^}), based on a prediction algorithm that satisfies (|5]l for 
some p£. Let Lc.n be defined by (l27T i. If I is convex in its first 
argument and takes values in the interval [0, 1] and rjt+i < rjt 
for t = 1, . . . , n — 1, then for all n > 1 and any T e Tn, the 
tracking regret satisfies 

Ln - Ln{T, a) 

< LciT),n{C{T) + l)p£ 



LciT)ACiT) + 1) 

Tn {LciT)AC{T) + 1) - 1) 



Vn 



(28) 



where the function r„(C) is defined as 

r„(C) = (C + e) Inn + ln(l + e) - Cine. 

Furthermore, for e < 1/2 and n > 5, the adaptive regret of 
the algorithm satisfies 



\^Vt r'^ {Lo,n - 1) 



t=l 



(29) 



where the function r^ (C) is defined as 

r;(C) = (C + 1) In n - (C + 1) Ine. 



On the other hand, if £ is exp-concave for some r/ > and 
we let 7]t = ri for t = 1 , . . . , n in Algorithm [T] then for any 
n > 1 and T € Tn the tracking regret satisfies 



Ln - Ln{T, a) 

< LciT)AC{T) + l)p£(^ 



Lc(T)AC{T) + 1) 

Lc(T),n{C{T) + 1) - 1) 



(30) 



while for < e < 1/2 and n > 5, the adaptive regret can be 
bounded as 

K < Lo^nPe + ^"^"^ " . (31) 

\Lo,n/ V 

Proof: First we show the bounds for the tracking regret. 
To prove the theorem, let Tn be defined as in Lemma[Tl and we 
bound the first and last terms on the right-hand side of (|7]i and 
^ (with ui^i in place of ?«„). Note that the conditions on pg 
imply that xp£{y/x) is a nondecreasing function of x for any 
fixed y > (this follows since p£{z)/ z = (^^(z) — 0)/(z — 0) 
is a nonincreasing function of z > by the concavity of p£, 
and hence zp£{l/z) is nondecreasing). Combining this with 
the bounds on C{Tn) in Lemma |4] implies 



(C(T„) + l)pe 



^C{Tn) + lj 
< LciT),n{C{T) + l)p£ 



Lc(T)AC{T) + l) 



The last term (1/77,1) ln(l/w^i (f„)) in O and dU can be 
bounded by noting that l/w^i(f„) < l/w;^i(f„) by ^ 
and the latter can be bounded using ( fTSl ): this is given by r„. 
This finishes the proof of the tracking regret bounds. 

Next we prove the bounds for the adaptive regret. Assume 
we want to bound the regret of our algorithm in a segment 
[t,t') with 1 < i < t' < n + 1. By Lemma [3] there exists 
a transition path Tf^i such that it has a switch point at 



t, has at most / 



log(*'-t) 



Llog(9+l)J 

[t, t'), and Wn{Tn) > 0. Let £1,^2, . • . , % 
points of Tn in [t + l,t') where C < I, and let to 



1 < Lo.n segments in 

denote the switch 

t 



and tg^-^ = t' . Notice that, since we are interested in the 
performance of the algorithm only in the interval [i, t'), a 
modified version of Lemma [T] can be applied, where the 
loss is considered only in the interval [t^t') and the weight 
of Tn can be thought to be the sum of the weight of all 
transition paths that agree with T„ in [t, t'). Specifically, letting 
TtA'iTj.^^) = {T e Tt'-i : T and fj-^i agree on [t,t')} and 
wfl,{Tn) = J^TeT I ''^^-i(^)' i*^ ^'^^ be shown similarly to 
Lemma [T] that in the case of a loss function that is convex 
in its first argument and takes values in [0, 1], for any expert 
i e £, 

t'-i 
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C + l 



t'-i 



In- 



(32) 



t'-ij 



Now —\n.'wfl,{Tti-i) can be bounded in a similar way as 
- In 1 (T„j in H: For t = 1 we can use For < > 2 it 
can be shown, following the proof of ( fTSl l in |3 |, that 



hi 



^fMrt'-i) 



< (C + l)ln(t'- 1)- (C + l)lne 

< (C + l)lnn- (6*+ l)lne (33) 



whenever e < 1/2. Indeed, let denote the event that Hs a 
switch point and let At-^....^tg denote the event that ti, . . . 
are the switch points in [t+1, t'). Since the switch probabilities 
P£i (s|s') ^re independent of s' and l—p£j (s|s') — ^z"-^z\ ^ 
for e < 1/2, we have 

^ J-j- 7I'(tc — 1) / J-j- Zoo — Zt-\ 



t'-l 



^oo — Zs-\ -yi 7r(tc — 1) 



■r~r ^ oo ^ s 1 TT 

^oo - ■^t'-2 TT 7r(tc - 1) 



> 



(t-l)l+^ 



et 



1+e 



(t'-l)^(t-l + e) (t + e)(t-l)i+^ (i'-l)^ 



,C+lil+e 



(t'-l)C+^(t-l + e)(f + e) 



,c+i 



> 



> 



(t' - 1)<^+^ ^1-^ (t' - 1) 



c+l 



where the second inequality follows form inequalities (36) and 
(38) in |3|, and the third follows since (t- 1 + e)(t + e) < . 

It is easy to see that the bound in ( l33T l is larger than ( fTsT i if 
n > 5. Thus, combining with ( |32] | for the maximizing value 
t = 1, t' = ri + 1 and using C* < io.n, we obtain the bound 
( |29] l on the adaptive regret. A modified version of ([32]) (without 
the f/s/8 term) yields dH) ■ 

Remarks: (i) Note that the tracking regret can be trivially 
bounded by {C{T) + 1) times the adaptive regret (as sug- 
gested by |11|). However, the direct bounds on the tracking 
regret are somewhat better than this: The first term com- 
ing from the adaptive regret bound would be Lo,„(C(r) + 
l)pf (ri/Lo,n), which is larger than the first term Lc.n{C{T) + 
l)p£{j-^ — (^(ryqriy) in the tracking regret bounds. This justi- 
fies our claim for exp-concave loss functions, since the last 
terms will be essentially the same, although the main term in 
the bound is not affected. The difference is more pronounced 
for the case of the convex and bounded loss function, where 



the middle rjt/S term becomes multiplied by {C{T) + 1) 
if the tracking bound is computed from the adaptive regret 
bound, resulting in an increased constant factor in the main 
term. 

(ii) The above theorem provides bounds on the tracking and 
adaptive regrets in terms of the regret bound p£ of algorithm 
A. However, in many practical situations, A behaves much 
better than suggested by its regret bound. This behavior is 
also preserved in our tracking algorithms: Omitting step (|9|l 
in Lemma [T] we can reglace the first term in ( |28] | and ( l30l ) 
with Ln{A,Tn) — Ln{Tn,a), which is the actual regret of 
algorithm A on the (extended) transition path T„. Reordering 
the resulting inequality, we can see that the loss of our 
algorithm is not much larger than that of A run on r„, for 
example, in the exp-concave case we have 



rn{Lc(T).n{C(T) + l)-l) 



E. Exponential weighting 

We now apply Theorem |2] to the case where A is the 
exponentially weighted average forecaster and the set of base 
experts is of size N, and discuss the obtained bounds (for 
simplicity we assume C{T) > 1, but C(T) — would just 
slightly change the presented bounds). In this case, if £ is 
convex and bounded, then by Q the regret of A is bounded 
by psin) — V n In N . Setting rjt = (/jlnn/y^ for some > 
irjt is independent of C{T) but depends on the time horizon 
n), the bound (l28T l becomes, for g = 0(1), 

Ln - Ln{T, a) 



< JniC{T) + l) 



logn 



2 IniV 



. Llog(,9 + 1)J 



^ V Liog(ff + i)J 



o 



lin 



Furthermore, if an upper bound C on the complexity (number 
of switches) of the meta experts in the reference class is known 
in advance, then -qt can be set as a function of C > C{T) as 

well. Letting 77* = ^8(0 + 1) Inn ( + 2)/^, the 

bound (|28Tl becomes 



Ln - Ln{T, a) 

< Wn(C(T) + l) 



logn 



Llog(5 + 1)J 



2 InA^ 



n(C + l) 



log" 



Llog(g+l)J 



2 lin 



O 



(C + l) Inn 



/ log" 

V Liog(s+i) 



We note that these bounds are of order {C{T) + 1)V nln^ n, 
respectively l)nln^n, only a factor of 0{VTnn) 
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larger than the ones of optimal order resulting from earlier 
algorithms 0, H, ||24l which have complexity 0{n^) (strictly 
speaking, the complexity of |4| is 0{nN), but, when combined 
with efficient algorithms designed for the base-expert class, 
only 0{n^) complexity versions are known |24|). In some 
applications, such as online quantization [241, the number 
of base experts N depends on the time horizon n in a 
polynomial fashion, that is, N ^ for some /3 > 0. In 
such cases the order of the upper bound is not changed; it 
remains still 0((C(T) + 1)\/ n \n' n) if the number of switches 
is unknown, and 0(V {C{T) + l)n It? n) if the maximum 
number of switches C{T) is known in advance. This bound is 
within a factor of 0(\/ln n) of the best achievable regret for 
this case. 

Next we observe that at the price of a slight increase of 
computational complexity, regret bounds of the optimal order 
can be obtained. In deed, setting g = 2n'' — 1 for some 7 e 

(0, 1) and rjt = 4>\J ''^^"'"(7"' ' > independently of the 
maximum number of switches, 

Ln - Ln{T, a) 



< \ln{C(T) + l)lnN [- + 2 



(f) C+l 




i + 2|7ilnn + ofw 

7 J \^ mn 



If rjt is optimized for an a priori known bound C > C{T), 
then we get 

Ln - Ln{T,a) 

< Wn(C(T) + l 



O 




These bounds are of the same 0{{C{T) + l)Vn \nn) and, 
respectively, {C + l)nlnn) order as the ones achievable 
with the quadratic complexity algorithms 121], |.24,| , but the 
complexity of our algorithm is only O(n'^lnn) times larger 
than that of running A (which is typically linear in n). Thus, 
in a sense the complexity of our algorithm can get very close 
to linear while guaranteeing a regret of optimal order. (Note 
however, that a factor 1/^7 appears in the regret bounds so 
setting 7 very small comes at a price.) 

A similar behavior is observed for exp-concave loss func- 
tions. Indeed, if £ is exp-concave and A is the exponentially 
weighted average forecaster, then by (|4]i the regret of A is 
bounded by p£{n) — i^^. In this case, for g = 0(1), the 
bound ( |30] | becomes 



< 



L„(T, a) 
{C{T) + 1 



)f^ 



C{T) + 1 



g(9+l)J 



which is a factor of O(lnn) larger than the existing opti- 
mal bounds of order 0((C(T) + l)lnn) (see lEJ-El, IS, 
lISTl ) valid for algorithms having complexity 0{ii?) (again. 



concerning lU, we mean its combination with some efficient 
algorithm designed for the base-expert class). Note that in this 
case the algorithm is strongly sequential as its parametrization 
is independent of the time horizon n. For g = 2-0^ — 1, we 
obtain a bound of optimal order 0{{C{T) + 1) Inn): 



Ln - Ln{T, a) 

{C{T) + 1 



< 



(lniV + lnn) + 0(l). 



Bounds of similar order can be obtained for exp-concave 
loss functions in the more general case when £ is not of size 
N , but is a bounded convex subset of an N dimensional linear 
space. Then p£{n) — O(lnn) for several algorithms A under 
different assumptions. This is the case for exp-concave loss 
functions when A performs exponential weighting over all 
base experts. Using random-walk based sampling from log- 
concave distributions (see 132]), efficient probabilistic approx- 
imations exist to perform this weighting in many cases. Exact 
low complexity implementations, such as the Krichevsky- 
Trofimov estimate for the logarithmic loss fT3l (see Example[T] 
below), are however, rare. When additional assumptions are 
made, e.g., the gradient of the loss function is bounded, the 
online Newton step algorithm of fl^ can be applied to achieve 
logarithmic (standard) regret against the base-expert class S. 
We refer to [33] for a survey. 



F. The weight function w^'^ 

In this section we analyze the performance of Algorithm [T] 
for the case when the "Krichevsky-Trofimov" weight function 



w 



KT 



(IniV + lnn) + 0(1). w 



is used. Our analysis is based on part (ii) of Lemma |3] 
following ideas of Willems and Krom ||9l who only considered 
the logarithmic loss. Applying the weight function w^'^ (de- 
rived from w^^), this analysis improves the constants relative 
to Theorem |2] for small values of g, although the resulting 
bound has a less compact form. Nevertheless, in some special 
situations the bounds can be expressed in a simple form. This 
is the case for the logarithmic loss, where, for the special 
choice g = 1, applying ( |24] |. the new bound now achieves 
that of |9| proved for the same algorithm. The idea is that in 
the proof of Theorem |2] the concavity of p£ was used to get 
simple bounds on sums which are sharp if the segments are 
of (approximately) equal length. However, in our construction 
the length of the sub-segments (corresponding to the same 
segment of the original transition path), or more precisely, 
their lower bounds, grow exponentially according to dZSl l. This 
makes it possible to improve the upper bounds in Theorem |2] 
It is interesting to note that the weight functions w^^ and w^^ 
give better bounds for g = , where the segment lengths are 
approximately equal, while the large differences in the segment 
lengths for g = 0(1) can be exploited by the weight function 



KT 



To obtain "almost closed-form" regret bounds for a general 
/?£, we need the following technical lemma. 

Lemma 5: Assume / : [l,oo) — > (0,oo) is a differentiable 
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function and G > 1. Define F : [1, oo) — )■ [0, oo) by 



Fis) 



for all s > 1. Then the second derivative of F is given by 



F"{s) 



sG\n2 s2,^ln2' 



Therefore, F is concave on [l,oo) if sf'{s) < f{s) for all 
s > 1. 



Proof: First note that, since 2^*^ 
Leibniz's integral rule gives 



s for c = i^f^. 



F'(s) 



/(I) 



since 



sGln2 JO 
/(I) -/(!) + /(.) 
sG In 2 

a / (s2-^G 



sG In 2 



/' (.s2 



-cG 



(9c sG In 2 

Differentiating F' gives the desired result. ■ 

Next we give an improvement of Theorem|2]for small values 
of g. For simplicity, the bounds are only given for the tracking 
regret. It is much more complicated to obtain sharp bounds for 
the adaptive regret, since, similarly to the proof of Theorem |2] 
it would require to lower bound the probability that a new 
segment is started at some time instant t, but here the switch 
probabilities pKT{t\t'), defined in (fTSI l, depend both on t and 
t', unlike pci{t\t') which only depends on t. 

Theorem 3: Assume p£{x) is differentiable and satisfies 
Psi^) ^ xp'ei^) for all x > 1, and Algorithm [T] is run with 
weight function {wj^^}. Let 



(C+1) 



+ 2(C+l)p£ 



Llog(g+l)J 



n 

G+1 



G+1 



.2-cLlog(3+l)J 



and 



fn{C) 

(G + l)ln2 



log' 



c+i 



Liog(g + i)J 



Liog(5 + i)J 



log 



G+1 



+ Llog(.g + 1)J + 8 



If £ is convex in its first argument and takes values in the 
interval [0, 1], and T]t+i < ?/t for i = 1, . . . , n — 1, then for 
any T G Tn the tracking regret satisfies, for all n, 



L,,-L„{T,a) < S{C,n)+Y, 



8 



rn{C) 



(34) 



On the other hand, if i is exp-concave for the value of 77 and 
rjt — rj for t — 1 , . . . , n in Algorithm [T] then for any T E %i 
the tracking regret satisfies 

r„(G) 



i„(T,a) < SiCn) 



Vn 



(35) 



Proof: We proceed similarly to the proof of Theorem |2] 
by first applying Lemma [T] However, the resulting two terms 
are now bounded using Lemma [3] (ii) instead of Jensen's 
inequality, which allows us to make use of the potentially 
large differences in the segment lengths. 

For any transition path T — {ti,...,tc) £ %i let 
T = {ti, . . . jtg) G Tn denote the transition path defined by 
Lemma [3] with w^'^{T) > 0. The first term of the first upper 
bound given in Lemma [T] can be bounded as follows: for any 
segment [tc, ic+i) = [tc, tc') of T, Lemma[3](i) and ( l23T l yield 



l — C 

f [log(g + l)J 

< / P£ 
Jo 



tr 



tr 



2cLlog(g+l)J 



dc + 2p£{tc+i - tc). 



Since the right-hand side of the above inequality is a concave 
function of s = tc+i — tc by Lemma |5] and the conditions on 
P£, Jensen's inequality implies 



ii) 



1=0 



C c'-l 



c=0 
C 



■r-^I / Ll°g(9 + l)J 

< El / 



tr 



tr 



2cLlog(g+l)J 



dc+ 2p£{tc+i~ tc) 



n 



/ llog(g+l)J 



2-cLlog(3+l)J 



+ 2{C+l)p£ 



n 



G+1 



(36) 



The weight function can be bounded in a similar way. By 
the standard bound (fT4l l on the ICrichevsky-Trofimov estimate 
lfT4l . we have 

In < In ■ ^ 



w^T) - w^iT) 
c 



(37) 



Applying 

c'-l ,^ 

Y -ln(tVi -i0 + ln2 



for a segment [tc,tc+i) = \tc^tc') of T yields 



< 



Y 

i=0 



c+1 



tr 



-In. 

2 \^ 2'Liog(9+i)J 



In 2 



■ln{tc+i -tc) + ln2 
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In 2 r iogftc+i - tc) 
2 Llog(.9 + 1)J 

X I l0g(<c+l - tc) - 



log(te+l-*c) 

Llog(g+l)J 



Liog(9 + i)J 



+ -ln(tc+i -tc) +ln2 



< 



\n2 \og'{tc+i-Q 
4 ^ Llog(.9 + l)J 

4 



+ Llog(3 + 1)J + 8 
4 



Llog(.9 + 1)J 



log(tc+l - tc) 



where in the last step we bounded the ceiHng function from 
above and from below, as appropriate. Furthermore, it is easy 
to check that the last expression above is concave in s = 
tc+i — tc- Therefore, combining it with (iJTl l. applying Jensen's 
inequality, we obtain 

In ^ <r„(C7). 



Applying this bound and 
ments of the theorem. 



in Lemma [T] yields the state- 



We now apply Theorem [3] to the exponentially weighted 
average predictor For bounded convex loss functions we 
have p£{n) = V n In N. Assuming g — 0(1), if ijt = 

^\/ n[\og{g^-{^ ^^& '1^ > ^i-^" is independent of the 
number of switches C{T)), we obtain 

Ln - Ln{T,a) 



< 2y/{C{T) + l)nln7V 1 + 



c+i 



Llog(,g + l)J In 2 



c+i 



nln2 



^°^"V 2Llog(, + l)J +"^(^+^)^)- 



Optimizing rit as a function of an upper bound C on the 
number of switches yields 



Ln - Ln{T,a) 



< 2^{C{T) + l)n\nN 1 



1 - 



c+i 



Llog(g + l)J In 2 



'(C+ l)?ilog^ 



c+i 



In 2 



8 Llog(5 + 1)J 



+ o{^{C+l)n). 



Note that if = 0{n'^) for some /3 > 0, the first term is 
asymptotically negligible compared to the second in the above 
bounds. For example, if 77 is set independently of C, we obtain 




— + o((C + l)V^). 



2 Liog(g + i)J 



On the other hand, if g = 2n' — 1, the bound becomes 

Ln - Ln{T,a) 



< 2v/(C(r) + l)nlnA^ 1 + 



c+i 



7 Inn 




when T] is set independently of C. 

For exp-concave loss functions we have, for g = 0(1), 



Ln - Ln{T, a) 

< 



log 



C+1 



477 V Llog(3 - 
O(Clnn) 



1)J 



4 In + In 



C+1 



while if 5 = 2n'^ — 1 we get 
Ln - Ln{T,a) 
< 



477 
f 0(C) 



4 I - + 2 lln iV 

7 



4 + 7 + - I In n 

7 



Note that for both types of loss functions we have a clear 
improvement relative to Theorem|2l where we used the weight 
function w^^ , for the case when g = 0(1). However, no such 
distinction can be made for g = 2ri' — 1. Indeed, for convex 
loss functions constant multiplicative changes in -q vary the 
exact form of the factor (C + a)/h, with constants a, & > in 
the second term, and, consequently, the order of the bounds 
depends on the relative size of C, while, for example, the value 
of ry determines the order of the bounds for exp-concave losses, 
e.g., constructing the weigh function w from w'~^ is better for 
7 > 1/3. Also note that the above bounds for .9 = 3 and 
(7 = 4 have improved leading constant compared to [lOJ and 
II3TI . respectively. 

IV. Randomized prediction 

The results of the previous section may be adapted to 
the closely related model of randomized prediction. In this 
framework, the decision maker plays a repeated game against 
an adversary as follows: at each time instant t = l,...,n, 
the decision maker chooses an action It from a finite set, 
say {1,...,A^} and, independently, the adversary assigns 
losses £i,t e [0,1] to each action i = l,...,n. The goal 
of the decision maker is to minimize the cumulative loss 

Ln — J2t=l ^It,t- 

Similarly to the previous section, the decision maker may try 
to compete with the best sequence of actions that can change 
actions a limited number of times. More precisely, the set 
of base experts is £ — {1, . . . ,N} and as before, we may 
define a meta expert that changes base experts C times by 
a transition path T — {ti, . . . ,tc;n) and a vector of actions 
a = {io, ic), where to 1 < < • • • < < tc+i ■= 
n + 1 and ij G {1, . . . , N}. The total loss of the meta expert 
indexed by (T, a), accumulated during n rounds, is 



L„(T, a) = ^ Li^{tc, tc+i) 



c=0 
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with 

tc + l-l 

There are two differences relative to the setup considered 
earlier First, we do not assume that the loss function satisfies 
special properties such as convexity in the first argument 
(although we do require that it be bounded). Second, we do not 
assume in the current setup that the action space is convex, 
and so a convex combination of the experts' advice is not 
possible. On the other hand, similar results as before can be 
achieved if the decision maker may randomize its decisions, 
and in this section we deal with this situation. 

In randomized prediction, before taking an action, the 
decision maker chooses a probability distribution Pj over 
{!,..., (a vector in the probability simplex Aat in M^), 
and chooses an action It distributed according to (condi- 
tionally, given the past actions of the decision maker and the 
losses assigned by the adversary). 

Note that now both i„ and L„(T, a) are random variables 
not only because the decision takes randomized decisions but 
also because the losses set by the adversary may depend on 
past randomized choices of the decision maker. (This model 
is known as the "non-oblivious adversary".) We may define 
the expected loss of the decision maker by 

N 

MPt) = ^Pr.A.t 

i=l 

where pi^t denotes the i-th component of p^. 

For details and discussion of this standard model we refer 
to im Section 4.1]. In particular, since the results presented 
in Section H] can be extended to time-varying loss functions 
and since It is a linear (convex) function, it can be shown 
that regret bounds of any forecaster in the model of Section 
Ucan be extended to the sequence of loss functions It- That 
is, the bounds can be converted into bounds for the expected 
regret of a randomized forecaster Furthermore, it is shown 
in m Lemma 4.1] how such bounds in expectation can be 
converted to bounds that hold with high probability. 

For example, a straightforward combination of fP. Lemma 
4.1] and Theorem |2] implies the following. Consider a pre- 
diction algorithm A defined in the model of Section IIII-AI 
that chooses an action in the decision space T) — An 
and suppose that it satisfies a regret bound of the form ^ 
under the loss function ?t(pj). Algorithm |2] below, which 
is a variant of Algorithm [T] converts A into a forecaster 
under the randomized model. At each time instant t, the 
algorithm chooses, in a randomized way, a transition path 
T = (ti, . . . , tc; t) G Tt, and uses the distribution ((rt(T)) 
that A would choose, had it been started at time Tt{T), the 
time of the last change in the path T up to time t. In the 
definition of the algorithm 

c 

c=0 

denotes the cumulative expected loss of algorithm A, where 



we define Iq = I and ic+i = t + 1, and 

tc+l-l 

is the cumulative expected loss suffered by A in the time 
interval [tc,tc+i) with respect to Is for s G [tc,ic+i)- 



Algorithm 2 Randomized tracking algorithm. 

Input: Prediction algorithm A, weight function {wt;t = 

1, . . . , n}, learning parameters r/t > 0,t ^ 1, . . . ,n. 

For t = 1, . . . ,n choose T £ Tt according to the distribution 

^ WtiT)e'^*T^^-^(^-'^*--^ 
choose Pt = Pa t{'^t{T)), and pick It ^ Pf 



Corollary 1: Suppose li t € [0, 1] for all i = 1, . . . , and 
t = l,...,n, and A satisfies Q with respect to the loss 
function {£t}- Assume Algorithm|2]is run with weight function 
{w^i} for some e > 0. Let (5 e (0, 1). For any T £ Tn, the 
regret of the algorithm satisfies, with probability at least 1 — 6, 

Ln - LniT,a) 

r]n \ 2 S ' 

where r„(C) and Lc\n are defined as in Theorem |2] 

Proof: Fkst note that Theorem |2] can easily be extended 
to time-varying loss functions (in fact. Lemma [T] and con- 
sequently Theorem |2] uses the bound ^ which allows time- 
varying loss functions). Combining the obtained bound for the 
expected loss with [[T] Lemma 4.1] proves the corollary. ■ 

V. Examples 

In this section we apply the results of the paper for a few 
specific examples. 

Example 1 (Krichevsky-Trofimov mixtures): Assume T> — 
£ = (0, 1) and 3^ — {0, 1}, and consider the logarithmic loss 
defined as £{p, y) — — Inp— Ij,=o ln(l— p). As mentioned 
before, the logarithmic loss is exp-concave with ?7 < 1, and 
hence we choose 77 = 1. This loss plays a central role in data 
compression. In particular, if a prediction method achieves, 
on a particular binary sequence = (yi, . . . , yn), a loss Ln, 
then using arithmetic coding the sequence can be described 
with at most L„ + 2 bits 1341 . We note that the choice of the 
expert class £ = (0, 1) corresponds to the situation where the 
sequence y" is encoded using an i.i.d. coding distribution. 
Competing against the expert class £ = (0, 1) also has a 
probabilistic interpretation: it is equivalent to minimizing the 
worst case maximum coding redundancy relative to the class 
of i.i.d. source distributions on {0, 1}". 
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Let no(t) = I]*^iIy,=o and ni{t) = J2l=ihs=i denote 
the number of Os and Is in y*, respectively. Then the loss of 
an expert 9 E (0, 1) at time t is 

Lg^t = -ln((l-6l)"°(*)ri(*)^ 

= -no{t)\n{l-e)-ni{t)\nd 

which is the negative log-probability assigned to by a mem- 
oryless binary Bernoulli source generating Is with probability 
6. The Krichevsky-Trofimov forecaster is an exponentially 
weighted average forecaster over all experts 6 E £ using initial 
weights l/(7r^6l(l - 9)) (i.e., the Beta(l/2, 1/2) distribution) 
defined as 



D — Lo^t-i 



d9 



TT^9{l-9) 

1 (1 _ 5/)no(t-l)^ni(i--l) 



d9. 



10 n^9{l-0) 
It is well known that pf^ can be computed efficiently as 



y*-i) = By [141, the 



pr(l|y*-i) = l-pr 
performance of the Krichevsky-Trofimov mixture forecaster 

can be bounded as 

i?„ < ilnn + ln2. 

In this framework, a meta expert based on the base expert 
class £ is allowed to change 9 E £ a certain number of 
times. In the probabilistic interpretation, this corresponds to 
the problem of coding a piecewise i.i.d. source ||2l, Q, Q- 
|j9l- If we apply Algorithm[T]to this problem with , we can 
improve upon Theorem [3] by using 7~„(C) instead of S{C,n) 
in the bound (note that f„(C) was obtained by calculating the 
Krichevsky-Trofimov bound for the transition probabilities), 
and obtain, for any transition path T E Tn and meta expert 
(T,a) 



Ln - Ln{T,a) 
< 2r„(C(T)) 

(C(T) + l)ln2 log^ 



^^^ + 0((C(r) + l)lnn). 



For g 



2 Llog(5 + 1)J 

1, this bound recovers that of |9| (at least in the 



leading term), and improves the leading constant for g ~ 3 
and g — 4: when compared to [lO] and [[TTl . respectively. 

On the other hand, for g = 2n'>' — 1, 7 > 0, using with w^^ 
in Algorithm [U Theorem [3] implies 

3(C(T) 



Ln - Ln{T,a) < 



'i + 2)lnn + 0(l). 



This bound achieves the optimal O(lnn) order for any 7 > 0; 
however, with increased leading constant. On the negative side, 
for specific choices of 7 our algorithm does not recover the 
best leading constants known in the literature (partly due to the 
common bounding technique for all 7): If 7 = 1/2, our bound 
is a constant factor worse than those of l?) and fS) which 
have the same 0{n'^^^) complexity (disregarding logarithmic 
factors); on the other hand, in case 7=1 our algorithm is 
identical to the 0{n^) complexity algorithm of Shamir and 
Merhav 13 , and hence an optimal bound can be proved for 



il)^^ (and for w^^), as done in ||3] achieving Merhav's lower 
bound Il30l . 

Example 2 (Tracking structured classes of base experts): 
In recent years a significant body of research has been devoted 
to prediction problems in which the forecaster competes with 
a large but structured class of experts. We refer to |1|, fTSl, 
ifTTl . [|24|, |35|-[38| for an incomplete but representative 
list of papers. A quite general framework that has been 
investigated is the following: a base expert is represented by a 
d-dimensional binary vector v E {0, 1}''. Let £ C {0, 1}'' be 
the class of experts. The decision space T) is the convex hull of 
£, so the forecaster chooses, at each time instant t = 1, . . . , n, 
a convex combination pt = J2veE E V C [0, 1]''. The 
outcome space is 3^ = [0, 1]'' and if the outcome is yt E y, 
then the loss of expert v is £{v, yt) — v^yt, the standard 
inner product of v and yt- The loss of the forecaster 
equals e{pt,yt) = Z]«e£ ^^i-.t^^^t- introduces a general 
prediction algorithm, called "Component Hedge," that 
achieves a regret 



^(P^ yy) - min V ^(w, yt) 



t=l 



< dy/2Kn \n{d/K) + dK \n{d/K) 

where K — max„gg What makes Component Hedge 

interesting, apart from its good regret guarantee, is that for 
many interesting classes of base experts it can be calculated in 
time that is polynomial in d, even when £ has exponentially 
many experts. We refer to ll36l for a hst of such examples. 
The results of this paper show that we may obtain efficiently 
computable algorithms for tracking such structured classes of 
base experts. For example, (l28l l of Theorem |2] applies in this 
case, with p£{n) = d^/2Knln{d/ K) + dK\n{d/K). The 
calculations of Section IIII-EI may be easily modified for this 
case in a straightforward manner. 

Example 3 (Tracking the best quantizers): The problem of 
limited-delay adaptive universal lossy source coding of in- 
dividual sequences has recently been investigated in detail 
ESl-lllQl, El, jm-gl]. In the widely used model of fixed- 
rate lossy source coding at rate R, an infinite sequence of 
[0, l]-valued source symbols xi,X2, ... is transformed into a 
sequence of channel symbols yi,y2, - - which take values 
from the finite channel alphabet {1,2,...,M}, M = 2^, 
and these channel symbols are then used to produce the 
([0, l]-valued) reproduction sequence xi, £2, ■ • ■■ The quality 
of the reproduction is measured by the average distortion 
Yl^=id{xt,xt), where d is some nonnegative bounded dis- 
tortion measure. The squared error d{x,x') — {x ~ x')^ is 
perhaps the most popular example. 

The scheme is said to have overall delay at most S if there 
exist nonnegative integers 61 and 62 with Si +62 < S such that 
each channel symbol y„ depends only on the source symbols 
xi, . . . , Xn+Si and the reproduction i„ for the source symbol 
x„ depends only on the channel symbols yi, . . . , yn+52 ■ When 
5 — 0, the scheme is said to have zero delay. In this case, y„ 
depends only on xi, . . . , a;„, and a;„ on yi, . . . , y„, so that the 
encoder produces y„ as soon as Xn becomes available, and 
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the decoder can produce when y„ is received. The natural 
reference class of codes (experts) in this case is the set of 
M-level scalar quantizers 

Q = {Q: [0, 1] ^ {ci, . . . , cm}, {ci, . . . , cm} C [0, 1]} . 

The relative loss with respect to the reference class Q is 
known in this context as the distortion redundancy. For the 
squared error distortion, the best randomized coding methods 
II20I . ||39]| . BTll . with linear computational complexity with 
respect to the set Q, yield a distortion redundancy of order 

The problem of competing with the best time-variant quan- 
tizer that can change the employed quantizer several times 
(i.e., tracking the best quantizer), was analyzed in [24 1, 
based on a combination of ll20l and the tracking algorithm 
of f4]. There the best linear-complexity scheme achieves 
0((C + l)lnn/n^/^) distortion redundancy when an up- 
per bound C on the number of switches in the reference 
class is known in advance. On the other hand, applying our 
scheme with g = 0(1) in the method of [24J and using the 
bounds in Section IIII-EI gives a linear-complexity algorithm 
with distortion redundancy 0((C + 1)^/^ hi^/^(n)/ni/'^) + 
0((C + l)/(ln^/2(n)/ni/2)) if C is known in advance and 
only sUghtly worse 0((C + 1)^/'^ \n^^^{n) /n^/^) + 0{{C + 
1) ln(n)/n^/^) distortion redundancy if C is unknown. When 
g = — 1, the distortion redundancy for linear complexity 
becomes somewhat worse, proportional to 2(2+7) up to 
logarithmic factors. 

VI. Conclusion 

We examined the problem of efficiently tracking large expert 
classes where the goal of the predictor is to perform as 
well as a given reference class. We considered prediction 
strategies that compete with the class of switching strategies 
that can segment a given sequence into several blocks, and 
follow the advice of a different base expert in each block. 
We derived a family of efficient tracking algorithms that, for 
any prediction algorithm A designed for the base class, can 
be implemented with time and space complexity 0{n'^\nn) 
times larger than that of A, where n is the time horizon and 
7 > is a parameter of the algorithm. With A properly chosen, 
our algorithm achieves a regret bound of optimal order for 
7 > 0, and only O(lnn) times larger than the optimal order 
for 7 = for all typical regret bound types we examined. 
For example, for predicting binary sequences with switching 
parameters, our method achieves the optimal 0{\nn) regret 
rate with time complexity 0(n^+'*'lnn) for any 7 G (0,1). 
Linear complexity algorithms that achieve optimal regret rate 
for small base expert classes have been shown to exist in ||4] 
and I16J. Our results show that the optimal rate is achievable 
with the slightly larger 0{n^^'^ Inn), 7 > 0, complexity even 
if the number of switches is not known in advance and the base 
expert class is large. It remains, however, an open question 
whether the optimal rate is achievable with a linear complexity 
algorithm in this case. 
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